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ABSTRACT 

In  the  unlabeled  data  problem,  data  contains  signals  from  various 
sources  whose  identities  are  not  known  apriori,  yet  the  parame¬ 
ters  of  the  individual  sources  must  be  estimated.  To  do  this  op¬ 
timally,  it  is  necessary  to  optimize  the  data  PDF,  which  may  be 
modeled  as  a  mixture  density,  jointly  over  the  parameters  of  all  the 
signal  models.  This  can  present  a  problem  of  enormous  complex¬ 
ity  if  the  number  of  signal  classes  is  large.  This  paper  describes 
a  algorithm  for  jointly  estimating  the  parameters  of  the  various 
signal  types,  each  with  different  parameterizations  and  associated 
sufficient  statistics.  In  doing  so,  it  maximizes  the  likelihood  func¬ 
tion  of  all  the  parameters  jointly,  but  does  so  without  incurring  the 
full  dimensionality  of  the  problem.  It  allows  lower-dimensional 
sufficient  statistics  to  be  utilized  for  each  signal  model,  yet  still 
achieves  joint  optimality.  It  uses  an  extension  of  the  class-specific 
decomposition  of  the  Bayes  minimum  error  probability  classifier. 

1.  INTRODUCTION 

In  many  real-world  problems,  there  are  a  variety  of  co-existing  sig¬ 
nal  types  imbedded  in  noise  and  the  exact  nature  or  classification 
of  signals  as  they  arrive  at  the  sensor  are  unknown.  This  is  some¬ 
times  known  as  the  unlabeled  data  problem.  However,  it  is  of¬ 
ten  necessary  to  obtain  statistical  characterizations,  or  probability 
density  functions  (PDF’s),  of  the  individual  signal  types.  We  can 
envision  two  general  types  of  problems  where  this  is  necessary.  In 
the  first  type  of  problem  (Type  I).  the  varous  signal  types  are  con¬ 
sidered  to  be  a  background  that  is  distinguished  from  some  desired 
signal  type.  In  active  sonar,  for  example,  there  are  a  large  variety 
of  reflecting  boundaries  in  the  ocean  that  must  be  distinguished 
from  the  reflection  from  a  ship  or  submarine.  In  such  problems, 
the  sub-classification  of  the  exact  background  signal  type  is  not 
important  but  the  PDF  of  the  background  is  necessary  in  order  to 
obtain  optimal  classification  performance  against  the  desired  sig¬ 
nal.  In  the  second  type  of  problem  (Type  II),  the  PDF’s  of  the 
individual  sub-classes  of  background  are  important,  but  it  is  not 
practical  to  manually  classify  the  data  for  the  purposes  of  estimat¬ 
ing  the  individual  PDF’s,  i.e.  the  training  data  is  unlabeled. 

We  present  below  an  algorithm  that  solves  the  global  problem 
of  estimating  the  parameters  of  the  PDF’s  of  the  individual  models 
from  the  unlabeled  data.  It  does  so  in  the  global  maximum  likeli¬ 
hood  sense,  but  without  the  high-dimensional  search.  Suppose  K 
samples  of  signal-plus-noise  are  observed  from  the  mixture  PDF 

M 

p(X;  A )  =  Y1  P<X a) 

m=sl 


where  X  =  {X*}*=1,  X*  6  HN  and  Hm  is  a  distinct  signal 
class  (or  PDF)  of  data  parameterized  by  Am  €  Am.  The  collection 
of  all  parameters  is  denoted  A  where 

A={p(ffm);{Am}m=t}  ;  A  6  A. 

The  joint  maximum  likelihood  estimate  is  the  A  which  maximizes 

K 

maxI[p(xfc;A),  (2) 

*=i 

There  are  two  aspects  of  this  problem  that  threaten  to  make  it  im¬ 
possible  to  solve  in  practice.  First,  the  varous  terms  in  the  mixture 
PDF  interact,  so  it  involves  a  search  for  a  single  maximum  point 
in  the  high-dimensional  space  A.  This  is  in  contrast  to  the  labeled 
data  problem  which  consists  a  set  of  simpler  problems 

K 

max  TT  p(X*1 ;  Am),  (3) 

Jfe=l 

for  each  m,  where  X™  are  data  samples  known  to  be  from  class  m. 
Second,  (3)  may  be  aided  by  reduction  of  X  to  a  sufficient  statistic 
Zm  =  Tm  (X)  apropriate  for  each  signal  class.  This  cannot  be 
done  in  (2)  unless  the  sufficient  statistics  are  the  same  for  all  m. 
Now,  we  solve  both  of  these  problems. 

2.  ASSUMPTIONS  AND  MATHEMATICAL  RESULTS 
The  following  two  assumptions  are  needed  for  the  main  results: 

Assumption  1  Let  Ho  be  a  noise-only  class  written p(X\Ho;  Ao). 
Within  Am>  for  each  m,  there  exists  the  same  "noise-only"  condi¬ 
tion  A^.  i.e., 

lim  p(X\Hm;  Am)  =  p(X|tf0;  Ao) 

Assumption  2  Suppose  for  each  Class  PDF,  p(X|ffm),  there  ex¬ 
ists  a  sufficient  statistic  for  the  parameter  Am  £  Am.  Denote  this 
sufficient  statistic  by  Zm  —  Tm  (X). 

Applying  the  above  two  assumptions,  we  have 

p(X\Hm',  Am)  _  p(  Zm  |  H,n  j  Am  ) 

p(X|/f0;A0) 
by  sufficiency.  Therefore, 

P(X;A)  A 
p(X|ff0;  Ao)  ^ 


p(Zm|JT0;  Ao) 


P(Zm|ffmi  Am) 
P(Zml-ffoi  Ao) 


(4) 
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Define  the  likelihood  function  to  be 


£(X;A) 


....  TT  p(Xfc;  A) 

~  llp(Xfc|ffoiA0) 

<fe=l 


where  Zm.k  =  Tm(Xfc),  A  =  {p(Hm)\  {Am}£f=1},  and  A0  is 
presumed  to  be  known. 

The  objective  is  to  estimate  A  using  the  E.M.  algorithm.  It  is 
shown  in  another  section  that  the  E-step  consists  of  maximizing 

Q(a;a')  =  EfcK=tE"=1[iogP(Hm) 


-t- log  p(Zm>fc  jiTm ;  Am) —  (6) 


fog  p(Zm,fc|fifoi  Ao)]7fcm(Zmtjb,  A  ) 


X  that  exceed  the  threshold.  Of  course,  the  best  approach  would 
be  to  integrate  the  weighting  directly  in  the  algorithm.  If  an  EM- 
algorithm  is  used  within  the  ML  estimator  for  Am>  it  is  straight¬ 
forward.  Below,  we  give  an  example  in  which  p{Zm.k\Hm',  Am) 
are  Gaussian  mixtures.  The  result  is  an  EM  algorithm  operating 
simultaneously  on  M  different  feature  spaces. 

3.  AN  EM  ALGORITHM  FOR  A  NON-HOMOGENIOUS 
GAUSSIAN  MIXTURE 

We  now  show  how  to  integrate  the  data  weighting  7 km  into  the  up¬ 
date  equations  for  the  EM  algorithm  for  the  individual  class  PDF’s 
using  an  example  where  p(Zm\Hm)  are  Gaussian  Mixtures.  Con¬ 
sider  a  Gaussian  muxture  for  Zm  €  under  class  m 

I'm 

2>(Zm|^m)  =  ^  ^Qmt  *A/rni(Zm)  (10) 

i= 1 

where 


over  A,  where 


Kkm  A) 


p(Zm,fa|f/Q;X0) 


p(Hm) 


1  =  1 


(7) 


From  hereafter,  we  drom  the  notational  dependence  on  Zm.k,  A, 
simplifying  the  notation  to  ~ikm.  The  M-step  produces  estimates 
of  p(tfm): 

K 

p(Hm)  =  'y  V  7 km  (8) 

fc=l 


2.1.  Discussion 

Notice  that  in  (6),  the  maximization  of  Q(A;  A')  with  respect  to  A 
requires  the  functions 


K 

Qm(  Am;  A  )  —  ^  "  logp(Zm,*|I?m;  Am)  7 km  (9) 

fc=i 


A4vi(Zm)  =  (2tt)-n”'2  ISmil-^expl-i 

(Zm  —  Pmi)  (Zm  —  Pm()} 

By  substituting  (10)  into  (4),  we  have  an  expression  for  p(X|A) 
that  is  a  kind  of  mixture  of  mixtures,  with  each  sub-mixture  a  func¬ 
tion  of  a  different  sufficient  statistic.  We  call  this  a  non-homogeneous 
Gaussian  mixture.  The  update  procedure  for  the  parameters 

A  =  {p(Hm),  ami,  l*mi,  ^mi} 
are  provided  in  Table  3. 

It  may  be  noted  that  the  above  update  equations  are  identical 
to  the  usual  Gaussian  Mixture  EM  updates  except  for  the  inclusion 
of  7 km  in  the  first  step.  Notice  also  that  because  of  the  way  Wi,k 
are  normalized,  the  data  weights  7jtm  may  be  arbitrarliy  scaled 
without  changing  the  algorithm. 

Because  of  possible  numerical  issues,  it  may  be  necessary  to 
add  a  constant  to  the  diagonal  elements  of  £mj  at  each  iteration. 
These  constants  may  be  regarded  as  prior  knowledge  in  the  form  of 
independent  measurement  error  variancess,  and  should  be  chosen 
carefully.  This  method  has  been  employed  with  much  success. 


to  be  independently  maximized  over  Am,  for  each  m.  This  is  in 
contrast  to  (5),  which  contains  mixed  terms  and  requires  joint  max¬ 
imization.  Equation  (9)  is,  in  effect,  is  a  probabilistic  weighting  of 
each  data  sample,  a  minor  modification  of  individual  maximum 
likelihood  estimators  represented  by  (3).  If  an  existing  algorithm 
exists  for  maximization  of  l°6P(zm,*|tfm;  Am),  then  this 
algorithm  may  be  used  with  a  minor  modification.  One  way  to  do 
it,  albeit  impractical,  would  be  to  scale  7 km  by  a  large  constant 
C,  round  to  an  integer  nkm  =  [C  7*mJ,  then  form  a  larger  data 
set  composed  of  a  replication  of  each  data  sample  Zm>fc  by  the 
corresponding  integer: 

Z  =  {(Ztn,l  '  ’  *  Zm.l },  •  •  •  ,  {Z m,K  *  ■  Z m,K }}  , 

then  maximize 

K' 

^logp(Z',|iLm;Am) 

fc  =  l 

where  K'  —  nfcm.  A  more  practical,  yet  suboptimal  method 
would  be  to  threshold  7 km  and  include  only  those  data  samples  in 


4.  SIMULATION  RESULTS  (7-CLASS  EXAMPLE) 

The  example  problem  to  be  discussed  here  is  a  subset  of  the  9-class 
synthetic  problem  discused  in  a  previous  paper  [1].  We  consider 
the  following  7  data  classes  denoted  Hi, ,  Hr. 

•  Class  Ho:  Noise  only 

•  Class  Hi:  Long  Sinewave 

•  Class  Hr.  Medium  Sinewave 

•  Class  H3:  Short  Sinewave 

•  Class  Hr.  Long  Gaussian  Signal 

•  Class  Hr  Short  Gaussian  Signal 

•  Class  ff6:  Short  Impulse  Signal 

•  Class  Hr.  Long  Impulse  Signal 

The  sufficient  statistics  are  tabulated  in  Table  2  and  their  dis¬ 
tributions  under  Ho  are  listed  in  Table  3. 

Futher  details  of  the  signal  types  and  their  distributions  under 
Ho  may  be  found  elsewhere  [2], 
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•  Repeat  for  m  =  1, . . . ,  M: 

1.  Compute  data  weights 

OCrtii  'Tfcm 

Wi*  =  TZ - ’ 

Otrni  A(mi(Z  m,J c) 

i=l 

fort  =  1 

2.  Let 

K 

o* — y^wi.k, 

«!=i 

for  i  =  1, . . . ,  Lm. 

3.  Update  the  means 

x  K 

Vmi  =  —  yv.*  Zm,*, 

‘  Je=l 

fori  =  1, 

4.  Update  the  covariances 

x  * 

Smi  =  —  ^  '  tUi,*  (Zm,*  — /tmj)  (Zm,*  —  A*mi)  , 
0< 

fori  =  l,...,£m. 

5.  Update  mode  weights 

at 

/yfr  —  l 

for  i  =  1, . . . ,  Lm. 

End 

•  Use  (7)  to  update  -fkm  for  all  k,m. 

•  Use  (8)  to  update  P(ifm)  for  all  m. 

Table  1:  Update  Equations  for  Non-Homogeneous  Mixture 

4.1.  DataSet 

To  simulate  a  data  set  from  a  mixture  of  the  seven  data  classes, 
an  equal  share  of  1024  samples  from  each  data  class  were  created. 
Each  input  data  sample  was  a  time-series  of  256  data  points.  The 
samples  were  then  joined  together  into  a  single  data  set.  The  true 
class  index  of  each  sample  was  not  used  by  the  algorithm,  but  was 
remembered  for  use  later  in  validation.  For  each  input  data  sample, 
the  sufficient  statistics  were  computed  for  all  class  indexes. 

4.2.  Algorithm  Initialization 

Initial  values  of  ftmi  were  set  equal  to  randomly  chosen  input  data 
samples.  Initial  values  of  £ were  set  equal  to  the  sample  co- 
variance  of  the  entire  data  set.  Initial  values  of  amj  were  all  equal, 
as  were  the  initial  values  of  P(Hm).  The  number  of  Gaussian 
mixture  components  per  data  class  was  10. 


Zi  =  [Eili*'  cos(wt)]2+  sin(wi')]2 


Z2  =  [S3i^i2  cos(wi)j  4-  Xi  sin(mi)j 


Za  =  [Z^i4  **  cos(wi) j  +  Xi  sin(wt)] 


Z4  =  £"t*t 


Zs  =  log(x?) 


Zt  =  log(x?  +  xj) 


Table  2:  Class-Specific  Statistics 


4.3.  Algorithm  Performance 

Algorithm  performance  may  be  measured  by  monitoring  the  likeli¬ 
hood  function  (5).  Notice  also  that  7 km  in  (9)  acts  as  a  probablistic 
data  weighting  for  each  sample.  It  is  in  effect  an  estimate  of  the 
probability  that  data  sample  k  is  from  class  m.  If  the  algorithm 
is  working  properly  and  p(Zm\Hm)  are  converging  to  the  true 
PDF’s,  7 jcm  should  act  as  data  classifiers.  Thus,  for  a  given  sample 
k,  maximizing  over  m  will  produce  a  guess  as  to  the  class  index  of 
the  sample.  But  this  will  not  work  in  general.  Specifically,  if  two 
data  classes  have  the  same  or  equivalent  sufficient  statistics,  the 
algorithm  has  no  way  to  make  the  separation  between  the  classes 
except  perhaps  as  different  Gaussian  Mixture  components  within  a 
fixed  m.  This  shortcoming  of  the  algorithm  is  expected  since  it  is 
designed  only  to  estimate  the  PDF  of  the  overall  non-homogenious 
mixture  (Type  I  problems).  The  separation  of  the  sub-classes  is  ir¬ 
relevant  to  its  operation.  However,  if  all  the  sufficient  statistics  are 
different,  it  has  a  chance  of  accomplishing  this  goal  (Type  II  prob¬ 
lems).  In  this  example,  we  monitor  the  algorithm  performance  as  a 
Type  II  problem  by  determining  the  probability  of  correct  classifi¬ 
cation  (PCc).  Pcc  was  determined  by  determining  what  percentage 
of  the  data  was  classifed  correctly  (i.e.  when  arg  maxm  7 km  was 
equal  to  the  true  class  index). 

The  algorithm  was  allowed  to  iterate  380  times.  At  each  it¬ 
eration,  the  total  likelihood  as  well  as  the  probability  of  correct 
classification  (Pcc)  were  determined.  These  quantities  are  plotted 
in  Figure  I.  Notice  that  the  likelihood  was  monotonic  increasing 
as  expected. 

The  algorithm  was  re-run  using  labeled  data,  i.e.  only  data 
from  class  m  was  used  to  train  p(Zm|Hm).  The  result  is  plotted 
in  Figure  2.  As  would  be  expected,  Pcc  is  higher,  but  the  likelihood 
is  lower  (not  discemable  on  the  graph).  Using  labeled  data  does 
not  nececarily  maximize  (5). 
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p(Zi|ffo)=(j&)«p{-&r} 

p(Zi\Ho)  —  (^)exp{-"7^] 

p(Z3|tf0)  =  (lv^)exp{-7tf^j 

p(Z<\H0)  =  £r-l($)  j*-1  exp  {-£r} 

p(Z5|H0)  =  £r-‘($)  2-S(f*)£-1  exp{-^} 

p(Z6|fifo)  =  (27t<t2)  1/2  exp 

p(Z7|flo)  =  (47HT2)  1/2  exp 

Table  3:  Distributions  of  Class-Specific  Statistics 


The  PDF  estimate  of  j)(2h\Ha)  from  unlabeled  data  is  plotted 
in  Figure  3.  Superimposed  on  the  graph  are  the  histograms  of  Z< 
for  all  data  classes  and  for  just  class  4.  The  fact  that  the  PDF 
estimate  matches  the  histogram  for  class  4  illustrates  the  fact  that 
PDF  estimates  may  be  obtained  from  unlabeled  data. 

5.  CONCLUSIONS 

An  E-M  algorithm  has  been  derived  for  the  case  when  the  input 
data  is  a  mixture  density  of  several  data  classes,  with  each  data 
class  dependent  on  a  different  set  of  parameters.  By  taking  ad¬ 
vantage  of  different  sufficient  statistics  for  each  data  class,  it  is 
possible  to  jointly  estimate  the  parameters  efficiently. 
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Figure  1:  Algorithm  Convergence  Properties.  Scaled  log  likeli¬ 
hood  values  superimposed  on  a  plot  of  Pcc- 


Figure  2:  Repeat  of  Figure  1  with  labeled  data  used  to  train 
p(Zm  l-ffm). 


Figure  3:  PDF  estimation  results  for  feature  Z4  using  unlabeled 
data.  Graph  includes  histogram  of  data  from  all  7  classes  (dot¬ 
ted),  histogram  of  data  from  class  4  (circles),  PDF  estimate  for 
p(Zi\H4)  (solid).  Data  is  plotted  on  a  normalized  axis.'  The  des¬ 
ignation  “SUX1”  is  the  name  used  to  identify  feature  Z4. 
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